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Abstract 

Java implementations of algorithms used by spreadsheets to au- 
tomatically recompute the set of cells dependent on a changed cell 
are described using a mathematical model for spreadsheets based on 
graph theory. These solutions comprise part of a Java API that allows 
a client application to read, modify, and maintain spreadsheet data 
without using the spreadsheet application program that produced it. 
Features of the Java language that successfully improve the running 
time performance of the algorithms are also described. 



1 Introduction 



This paper describes algorithms for the recomputation of spreadsheet cells. 
The assumed context for such a recomputation occurs when a cell's value is 
changed. In general, a cell is dependent on several others for its value as 
defined by its formula. Thus, to maintain the integrity of the spreadsheet, 
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the reading of a cell value requires the recomputation of this cell once any of 
the cells on which it depends has changed. 

The algorithms of this paper form the basis of ExcelComp jU], a Java 
application program interface (API) written by the author that al- 
lows the client application to read a specially formatted Microsoft Excel [3] 
(henceforth referred to as "Excel") spreadsheet output file, and then make 
changes to cell values within the ExcelComp representation of this spread- 
sheet. Changes to cell values are followed by the automatic recomputation of 
dependent cell values using ExcelComp methods. ExcelComp thus allows the 
client programmer to provide its users with both the data and behavior of an 
existing spreadsheet without the use of the original spreadsheet application 
program that produced it. 

During the development of ExcelComp, it was realized that choices of 
algorithms to perform cell recomputation involve two principal trade-offs: 1) 
ease of use, and 2) running time performance. On the one hand, one mode of 
ExcelComp can simply load a file at run time that represents the spreadsheet, 
and then provide its services. While this mode is satisfactory for many tasks, 
it is unsuitable for those that require a large number of cell recomputations 
to support dynamic updates to real-time outputs, for example, the updating 
of an on-screen map that depends on thousands of cell recomputations. To 
support this latter task, a second mode was developed that allows faster cell 
recomputation at the expense of a less convenient installation procedure for 
the spreadsheet representation. 

These considerations make ExcelComp an efficient, platform- and vendor- 
independent Java API that provides built-in spreadsheet emulation for ap- 
plication end-users. In particular, end-users are relieved of the burden of 
conducting their spreadsheet tasks outside the domain of their running ap- 
plication. In addition to the efficiency won by executing spreadsheet tasks 
natively, ExcelComp also obviates the need for costly additional licenses re- 
quired for multiple users of the application software that produced the spread- 
sheet. Being written in the modern Java programming language allows the 
client programmer to easily integrate ExcelComp's functionality into current 
software development efforts. 

While other descriptions of spreadsheet algorithms are available ^H] [Hj j 
this paper is distinctive in its use of graph theory to improve the reader's 
ability to visualize the algorithms, and to provide a basis for a proof of 
algorithm correctness. It also presents solutions that leverage features in the 
object-oriented Java API that lead to succinct, yet powerful code. 
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This paper focuses on the subject algorithms and the specific features of 
the Java language used by ExcelComp that are well-suited for their imple- 
mentation. Readers interested in a more detailed specification of ExcelComp 
from the client programmer's perspective may contact the author. 

2 A Scenario 

Before getting into the technical details that comprise this report, it would 
be helpful to consider a motivational scenario. 

Consider a spreadsheet of financial data, where subtotals, interest earned, 
and a grand total might be some examples of computed quantities that each 
depend on entries in several cells. Analysts may use such a spreadsheet to 
play "what-if" games by varying values in cells that will affect some target 
cell, such as interest earned. The spreadsheet program would then auto- 
matically recompute all cells that are dependent on the ones changed. This 
capability of a spreadsheet is its hallmark, and distinguishes it from a simple 
table of values that have no computational relationship to one another. 

Suppose that a computer program needs this spreadsheet of information 
and auto-update capability to carry out its tasks. This program is to provide 
its users with the what-if capability, and therefore requires not only cell 
values, but also cell formulas. Since it needs to emulate the recomputation 
function of the spreadsheet, it must implement algorithms that return the 
same recomputed values as the spreadsheet program. It is these algorithms 
of dependent cell recomputation that are the subject of this report. 

3 Modes 

ExcelComp has two modes of operation: 

Interpreted mode requires the reading of an extensible Markup Language 
(XML) [2] representation of a spreadsheet. Once this file is parsed and 
loaded into ExcelComp's data structures, the subject algorithms are 
implemented via ExcelComp methods. It is called interpreted, because 
cell formulas are interpreted at run time using a custom parser that 
recognizes a subset of Excel's formula language. 
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Compiled mode uses cell-specific Java classes, created as an offline pre- 
processing task, to evaluate a cell by recursively evaluating each child 
cell referenced in its formula's parse tree. 

The interpreted mode is the slowest of the two. It has the advantage, how- 
ever, of requiring less preprocessing, namely just the creation of the XML 
input file. During the development of ExcelComp, this XML file was pro- 
duced by running an Excel macro jS]. It is also easier in this mode for the 
client programmer to provide the application user the flexibility to apply Ex- 
celComp to a different spreadsheet by simply changing a filename reference 
in ExcelComp's constructor. The comprehensive update of all dependent cell 
values upon the change of a constant cell in this mode allows the ExcelComp 
user to highlight the newly computed dependents. Such an application was 
developed by the author, where changed cells are shown in a JTable [22] with 
changed values highlighted in red to allow the user to gain insight into the 
impact of the change of a cell value. 

The compiled mode is much faster than the interpreted mode, and should 
be preferred in cases where a client demands exceptionally high execution 
time performance, e.g., providing a real-time screen update that depends 
on the recomputations. The preprocessing needed for compiled mode in- 
cludes the generation of Java source code that implements the formulas of 
the spreadsheet. This source code generation was automated by using a Java 
class that uses classes produced by parser generators [TJ H3] to implement a 
custom parser for a subset of the Excel formula language. In general, Java 
classes that represent each cell formula in the spreadsheet must be created, 
and then be referenced by the classpath option for the Java virtual machine 

(jvm) psj. 

4 Computation Model 

To provide a lingua franca for the discussion to follow, we need to identify 
parts of a spreadsheet that are useful to us. While the intent is to have a 
model that is generic, platform-, and vendor- independent, the use of Excel 
as a reference implementation for ExcelComp influenced the latter's design. 
The language of this paper will be similarly influenced, however it is germane 
to any spreadsheet that adheres to the computation model described here. 
While a queue-based computation model has been successfully developed 
we will find it advantageous to develop a model using graph theory. 
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In particular, proof of correctness of the algorithms can benefit from such a 
treatment. 

There are many ways to present data using a spreadsheet. For example, 
two principal classes of representation provided by Excel are the workbook, 
and the chart. We consider only the tabular computation environment found 
in a workbook. The term spreadsheet will thus be used as a synonym for 
workbook. 

A spreadsheet is a finite set of cells arranged as a matrix. A cell is a 
set that contains three elements of interest: 

1. a value, 

2. a formula, and 

3. a cell reference. 

A cell's value is the result of the computation specified by its formula. 
In general, this value may be a real number, a string, or some other data 
type. To simplify this model, we will assume that these values are real. A 
cell's formula is an expression that defines a cell's value as a function, /, of 
a subset of the spreadsheet's cell values. Let C denote the set of cell values 
for some spreadsheet. More formally then, we have 

/ : C n - C, (1) 

where C n is the n-fold Cartesian product of C for some positive integer n. 
For this model, we define C — R. In general, a formula expresses a composite 
of functions in the form (JTJ). A formula that is not composite has constant 
values for its arguments. Such a formula is termed a constant formula. 

A cell reference is an ordered pair that specifies a cell uniquely within 
the spreadsheet. The Excel "Al reference style" [S] will be used, where the 
first element specifies the column, and the second element specifies the row. 
For example, B3 designates the cell at the intersection of the second column 
and the third row. We use Xi to denote a variable whose value is a cell 
reference. In the context of a formula, a cell reference is mapped to its 
corresponding cell value according to ([I]). We thus see that, in general, a cell 
reference refers to a composite function that is defined by the formula for 
that cell. For example, if cell Ml depends on Nl and P2, and Nl depends on 
Q3, and P2 depends on Rl, then the value of Ml expressed as a composite 
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function is M1(N1(Q3),P2(R1)). For this expression to be fully resolved, the 
formulas for both Q3 and Rl must be constant formulas. 

To successfully develop the subject algorithms, the scope of the set of 
spreadsheets to be considered must be defined. This will be done by identi- 
fying properties that serve as axioms for the spreadsheets of interest. Spread- 
sheets that satisfy the stated properties are termed admissible. 

There is a cohesive relationship between a cell's formula and its value, as 
described in the following property. 

Property 1. A cell's value is completely determined recursively by its for- 
mula. 

To be clear, Property Q states that a cell's value depends only on its 
formula, but that formula in general depends on other cell values that are, in 
turn, dependent only on their formulas. This recursion ends when a cell with 
a constant formula is reached, thus resolving all of the recursive cell value 
references. 

The formulas of a spreadsheet define a potentially involved relation among 
its cells. Given cell Xi, its formula may be a constant formula, or a non- 
constant formula that defines Xi's value as a function of other cell values in 
the spreadsheet. In the latter case, we say that Xi is dependent on the cells 
referenced in its formula. That is, the value of Xi depends on the values of 
the cells referenced in its formula. The term "value" will often be omitted 
when the context of "dependent" is clear. We also refer to each cell referenced 
in Xi's formula as a child of Xi. Similarly, Xi is a parent of its children. A 
parent and any parent of a parent is termed an ancestor. A child and any 
child of a child is termed a descendant. 

The relation among the dependent cells in a spreadsheet may be repre- 
sented as a weakly connected directed graph, G(V, E), where V is the set of 
vertices, and E is the set of edges. Figure Q illustrates this spreadsheet de- 
pendency graph with an example that will be used throughout this paper. 

Each vertex of the graph is a cell represented as a box. The cell refer- 
ence is given at the top of each cell box, and its corresponding formula is 
given in smaller type at the bottom of the box. It is understood that the 
value of a cell is assigned the value computed by its formula. The cell values 
are omitted from the cell boxes in Figure for brevity. Note that CI has 
two parents: El and Fl. This lack of a unique parent in general for each 
cell precludes regarding this structure as a rooted tree. Although the more 
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Figure 1: Example of a Spreadsheet Dependency Graph 

general graph is the appropriate data structure for representing cell depen- 
dencies in a spreadsheet, we will find that rooted trees will also be useful in 
the algorithms to be described. 

In describing Figure d some of the basics of graph theory, using jH] as a 
guide, will be described as needed. 

In Figure ^ the set of vertices V are the cells, and the set of directed 
edges E is defined according to the parent/child relationships. Each edge 
is directed from a parent to a child. The set E is a subset of the ordered 
pairs of V. Let a sequence of vertices be ordered such that v^i is a parent 
of Vi, and let t>j_ii>j denote the edge directed from to Vi. A sequence of 
edges and vertices that has the form {voV\, V1V2, ■ ■ ■ , v n -iv n } for distinct V{ is 
defined as a directed path linking vq and v n . A directed path is a directed 
cycle if it consists of 2 or more vertices, and v n — Vq. 

We are now led to an important stipulation concerning spreadsheets. 

Property 2. An admissible spreadsheet contains no directed cycles. 

The Excel term for directed cycle is circular reference. Property 121 thus 
states that circular references are prohibited. 

A graph from the subset of graphs just described is termed a directed 
acyclic graph or dag section B.4]. 

Readers familiar with the GNU Make tool [2H1 will recognize this depen- 
dent cell recomputation problem as being analogous to the problem solved by 
Make: automatic determination of the pieces of code that require recompila- 
tion, and issuing the appropriate commands to bring the program up to date. 
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Make uses a dependency graph model. See for details and illustrations 
of Make's dependency graphs, including a description of pitfalls concerning 
the proper use of Make to ensure correct dependency graph construction in 
large projects. 

5 Interpreted mode 

This section describes those algorithms that are implemented in the inter- 
preted mode of ExcelComp. The integrity of the spreadsheet is preserved 
in this mode by recomputing all dependent cells of a cell whose constant 
formula (value) has changed. This behavior ensures that, upon the commit- 
ment of a new value to a cell, the entire spreadsheet will be updated to reflect 
the change. This matches the default behavior of Excel, where it is termed 
automatic calculation. Pseudocode is given in this section to highlight the 
salient features of the ExcelComp interpreted mode algorithms; the actual 
code differs in some of the implementation details. 

5.1 Dependency Set Generation 

Suppose that we are examining a spreadsheet for the first time, and have 
no a priori knowledge of its contents. Say we want to modify cell Al. By 
this, we mean that Al has a constant formula that is to be changed to 
another constant formula. The more general act of modifying or adding a 
non-constant formula will not be discussed here; it is assumed that non- 
constant formulas remain fixed throughout our analysis. 

Consider the impact that this change has on the cells that are dependent 
on Al in Figure ^ First, this change in the formula causes a recomputation 
of Al's value. This change will, in general, affect the values of all cells whose 
formulas reference Al. These cells, Bl and CI, are directly dependent 
on Al, and will need to be recomputed as a result. In general, the values 
of these direct dependents will change as a result of recomputation. These 
direct dependents must then be considered in the same light as Al; that is, 
we need to find and recompute the direct dependents of the direct dependents 
of Al. These cells are El and Fl. From the point of view of Al, these latter 
cells are indirect dependents of Al. 

The algorithm for discovering the set of dependent cells of a given cell is 
thus recursive. Let d be a set-valued function d : 2 C — > 2 C that computes 
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the set of direct dependents of a subset of cells from C. (Here, 2 C denotes 
the set of all subsets of C, also known as the power set.) The procedure just 
described can now be expressed as 

A i+1 = d(Ai), A t 6 2 c , A = {A1}. (2) 

A is set to {Al} in (0) to reflect Figure [TJ but in general it will be 
assigned to the set containing the cells that were changed. 

Dependency Set Generation may be recognized as an implementation of 
the breadth-first search (BFS) algorithm for graphs described in [3]. The 
"frontier between discovered and undiscovered vertices" described in |3] ap- 
plied here divides two generations of dependencies, i.e., child/parent, par- 
ent/grandparent, etc. Also, note that we begin with a child, and then dis- 
cover ancestors, in reverse to the naming convention used in 

For recurrence relation © to be practical, we must be assured that it 
terminates. Indeed, an essential property of an algorithm is its finiteness; 
according to ^H] , 'An algorithm must always terminate after a finite number 
of steps." This assurance is given now as a theorem. 

Theorem 1. Recurrence relation (J2J) terminates. 

Proof. Because the indices of Ai in the recursion are strictly increasing, it is 
sufficient to show that 3z maa; ^ i < imax- 

Assume that the spreadsheet under consideration has N cells. Let \\Ai\\ 
denote the number of parents of Because of Property 121 the number 

of candidate parents for A is N — 1, since a cell cannot be a parent of itself 
(thereby creating a circular reference). Similarly, the number of available 
parents for A\ is at most N — 2, since both the children and grandchildren 
of A 2 must be excluded to avoid circular references. In general then, 

II All < N - i, 

and in particular, ||Azv|| = 0. At this point, no parents are available to 
continue further. We have thus shown that i < N; that is, i max = N. □ 

The final product of Dependency Set Generation is formed by taking the 
union of the sets of dependent cells found in ((21). Assuming that the final 
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index computed in (J2J) is n, and letting V be the set of dependent cells, we 
have 

n 

v = \Ja, Ae2 c (3) 



Input: Set of cells for which dependent cells will be found 
Output : Set of cells that are dependent on the input 

set of cells 
DepSetGen (depSet , m) 
{ 

initial_size = depSet . size () ; 
for (k=m through initial_size-l ;k++) { 
for (j=0 through SPRSHEET_SIZE-1 ; { 
// Find all direct dependents of depSet [k] . 

if (sprsheet [j] . formula contains depSet [k] . ref) { 
depSet . add (sprsheet [j] .ref); 

> 

} 

DepSetGen (depSet , initial_size) ; 

} 

if (m == 0) { 

depSet . delete (depSet [0] ) ; 

} 

} 

Figure 2: Dependency Set Generation Algorithm 



5.1.1 Example 

The Dependency Set Generation algorithm is codified in Figure El Let us 
apply this algorithm to finding all dependents of Al in Figure 

Assume the number of cells in the spreadsheet, SPRSHEET_SIZE, is 6. 
The array sprsheet holds all the cells in the spreadsheet. Each element 
of sprsheet has the fields ref and formula for cell reference and formula, 
respectively. The array depSet will be built up to contain the cell references 
of all of the dependents of its initial value. 
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We make the initial call to DepSetGen with depSet initialized to contain 
Al, and the depSet element marker m set to 0. This marker's value is the 
index of the cell in array depSet whose dependents are sought. Variable 
initial_size is set to the number of elements in depSet. Since depSet 
contains only Al, initial_size = 1. Loop counter k ranges from through 
0. The inner loop checks to see whether any spreadsheet cell formula contains 
Al. If it does, the cell reference is added to depSet. At the time where 
the inner loop is finished, both Bl and CI are appended to depSet. At 
the bottom of the outer loop, it is time to make the first recursive call to 
DepSetGen. 

The actual parameters passed to DepSetGen are the newly updated 3- 
element depSet, and the initial size of depSet before the loops, 1. Now 
entering the first recursive call of DepSetGen, m is 1, initial_size is 3, 
depSet consists of Al, Bl, and CI. Loop variable k ranges from 1 through 
2. The first time through the inner loop finds all dependents of Bl, and adds 
the one dependent found, El, to depSet. When k is 2, the inner loop finds 
all dependents of CI, and adds the one dependent found, Fl, to depSet. 

Upon the next recursive call, the marker is set to the next unexamined 
element of depSet, El at index 3. No dependents are found. Similarly for 
Fl at index 4. Finally the outer loop is skipped, and control is eventually 
returned to the original call of DepSetGen, where m is 0. Lastly, the if 
statement is executed, and the initial element Al is removed from depSet, 
since Al is not dependent on itself. It is assumed depSet contains just one 
cell during the initial call to DepSetGen. Though not shown in Figure 01 an 
additional step of removing duplicate cell references from T> is required to 
ensure that all of its elements are unique. 

To conclude this section, it will be shown that the Dependency Set Gen- 
eration algorithm just described indeed finds all dependents of a given cell. 

Theorem 2. The Dependency Set Generation algorithm identifies all depen- 
dents of its input set of cells. 

Proof. The proof is by contradiction. Assume we have a spreadsheet with 
cells Ci, i = 0, 1, 2, . . . , N — 1. Suppose 3Ck G C that is dependent on C , but 
was not identified by the Dependency Set Generation algorithm. Then by 
Property [TJ this dependence of Ck on Co must be due only to CVs recursive 
formula. There must then be a directed path in CVs dependency graph from 
Ck to Cq. Since Ck was not identified, there is at least one cell Cj in this 
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path that was not found during the recursion. But this contradicts the step 
in the algorithm that says to find all direct dependents of Cj-\. □ 

5.2 Recomputation of Dependency Set Members 

Once the dependency set T> has been generated, the process of recomputing 
these cells can begin. 

To evaluate a cell, designated the original cell, each argument in its 
formula, the original formula, must be evaluated. This evaluation is in 
general a recursive procedure. By drawing a directed edge from the original 
cell to each cell referenced in the formula (one edge per cell in the formula), 
we get a rooted tree that is rooted at the original cell. By applying this 
algorithm recursively on each cell in the formula, we get several paths, each 
of which ends at a leaf having a constant-formula cell. The resulting rooted 
call tree is just a subset of the spreadsheet dependency graph. We may 
then start at the leaves of the tree to construct a string of formulas of the 
cells in a given path in the direction back toward the root. Each string 
is a self-contained sequence of formulas that allows the original cell to be 
evaluated. 

Lastly, once all the arguments in the original formula have been evaluated, 
the original cell can be evaluated. Those arguments in any formula that are 
not members of T> need not be recomputed; their current values can be used 
instead. 

The foregoing procedure is an implementation of the depth-first search 
(DFS) algorithm described in [3]. In this case, the use of the terms "prede- 
cessor" and "descendant" in [2j is consistent with our usage. However, we 
do not use "timestamping." 

We can see a lot in common here with the Dependency Set Generation al- 
gorithm. Once again, we see a recursive procedure being described, although 
it is not as easily expressible in one line as in (J2J). Instead, we will codify the 
algorithm in the pseudocode given in Figure EJ 

5.2.1 Example (continued) 

Continuing the example begun in the Dependency Set Generation section, 
consider again the dependency graph in Figure [TJ Suppose that El is to be 
evaluated. The well-known left-to-right, post-order tree traversal algorithm 
will be used to specify the order of evaluation of El's descendants. During 
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Input : Original Cell 

Output : Original Cell with newly computed value 
depth =0; // Initial value (global) 
EvalCell(cell) 
{ 

for (i=0 through cell .nchildren-1 ; i++) { 
depth++ ; 

EvalCell (cell . child [i] ) ; 

} 

parser_str . append(cell . ref + + cell. formula + ';'); 
if (depth == 0) { 

cell. value = parse (parser_str) ; 

} 

depth — ; 
return cell; 

} 

Figure 3: Dependency Set Evaluation Algorithm 



the first call to EvalCell, the recursion depth variable depth is initialized to 
0. In addition to the ref and formula fields, assume that a cell object also 
contains a field nchildren that gives the number of children in its formula 
field. The object cell also contains a child array, each of whose elements 
is a cell representing each child in its formula. The elements of child are 
stored in order of occurrence in its formula, element being the leftmost 
child, and element nchildren-1 being the rightmost. 

We thus begin by calling EvalCell with cell .ref set to El. The recur- 
sive evaluation of El's call tree begins with the leftmost cell reference in El's 
formula, Bl. Bl £ T>, and therefore must be recomputed. We then begin 
with the leftmost element of its formula, and find that it is the constant 1. 
This is a constant, and thus requires no further evaluation; we move to the 
next element, Al. Al is the changed cell, and its value is known, thus no 
further analysis is needed. 

We have now reached the end of Bl's formula, allowing us to compute its 
value. We thus go back up to El to process the next argument in its formula, 
CI. CI £ V, and therefore must be recomputed. The leftmost child of Cl's 
formula is Al, and has already been evaluated. Cl's next child, Dl ^ T>, and 
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thus does not have to be recomputed. Its current value of 10 is used. 

We have now recursively evaluated all of the children of El's formula, and 
completed the building of parser_str. This string may then be passed to a 
parser for evaluation of El. ExcelComp uses an LALR parser developed by 
the author using the tools JLex pP and CUP ^3]. LALR stands for LookA- 
head Left-to-right identifying the Rightmost production, and is described in 
[12] . For this example, the string is: 

A1=2;B1=1+A1;A1=2;D1=10;C1=A1+D1;E1=B1+C1; . 

Here, "=" stands for assignment, and ";" delimits each assignment. It 
is assumed that the parser stores values via the assignment statements, and 
that these values may be retrieved at points later in the parse string. The 
history of the construction of the parser string is summarized in Table ^ 
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1 
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1 
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B1 = 1+A1 


Al=2; 


2 


Dl 





Al=2 


B1 = 1+A1 


A1=2;D1 = 10; 


1 


CI 


1 


Al=2 


B1 = 1+A1 


A1=2;D1 = 10;C1=A1+D1; 





El 


1 


Al=2 
E1=B 


B1=1+A1 
1+Cl; 


A1=2;D1 = 10;C1=A1+D1; 



Table 1: History of Parser String Construction for El 

Note that there is a redundant assignment "Al=2" in parser_str. This is 
an example suggesting efficiency enhancements that can be seen to improve 
the performance of these algorithms. Improvements include, but are not 
necessarily limited to: 

1. Recompute a cell value only once. 
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2. Do not recursively evaluate cells that are not in D. 

Although these improvements surely are desirable for minimizing the 
number of algorithm steps, experience with their use in ExcelComp revealed 
that the time to run the additional code required to implement these im- 
provements largely cancels out the benefits of fewer cell evaluations. 

More ideas concerning the speed-up of interpreted mode are given in the 
Algorithm Complexity section. 

6 Compiled mode 

While the compiled mode agrees with interpreted mode with respect to the 
preservation of spreadsheet integrity, it uses a clever postponement of compu- 
tation technique to update a dependent cell's value just prior to the reading 
of its value via its accessor method. Using this deferred recomputation strat- 
egy, the execution time associated with recomputing those dependent cells 
whose values are never accessed is eliminated. This deferred recomputation 
is similar to Excel's manual calculation mode, where the user specifies when 
recomputation is to occur thus deferring immediate recomputation. It is also 
similar to the mark-sweep garbage collection algorithm [2*§] . 

Compiled mode is implemented by ExcelComp's use of the Cell API 
|27j . The highlight of this mode is preserving spreadsheet integrity while 
improving the running time performance of ExcelComp. Several techniques 
are used to meet this goal, including the use of: 

1. the Java Reflection API 

2. Hash containers, and 

3. Deferred recomputation of dependent cells. 
6.1 The Cell Class 

Cell is an abstract Java base class that provides a framework for modeling 
the cells of a spreadsheet, each of which is represented by a class derived 
from Cell named CellXi, where Xi denotes the Al-style reference to its 
corresponding cell. 

On its initial invocation, Cell's accessor class method getCell instanti- 
ates a CellXi object via the Java Reflection API '[2o] . This technique allows 
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a Java class to instantiate another class whose name is created at run time 
by the calling method. To improve the efficiency of subsequent accesses, 
Cell has a class variable workbook to reference a HashMap of references to 
previously instantiated CellXi objects. Similarly, each CellXi object has 
a HashSet named dependencies that contains references to CellXi objects 
that correspond to cells that are direct dependents (parents) of the CellXi 
object that owns dependencies. 

In a CellXi's constructor, each child's instance method addDependency is 
called to add Xi to that child's dependencies set. The use of getCell during 
this process causes each child's object, on its first access, to be initialized with 
its constructor. A DFS traversal of CellXi's call tree thus occurs so that each 
traversed parent appears in each of its children's dependencies set. The set 
of all such parents is the set of discovered ancestors. Both workbook and 
dependencies contain only minimal subsets of the full set of their respective 
data that describes the entire spreadsheet as determined by the history of 
Cell method calls. These subsets are updated as necessary, and suffice for 
computing correct results when recomputation of cell values is necessary. 

6.2 Preprocessing 

Figure |U details the differences in the preprocessing requirements for the two 
modes of ExcelComp. 

The preprocessing necessary for compiled mode begins with the same 
XML input file used in interpreted mode. A Java API named GenCell [TU] 
is run on this input file to produce a set of Java source files. Each source 
file defines a CellXi class that corresponds to a spreadsheet cell. GenCell 
uses a JLex/CUP-based parser similar to that used in interpreted mode to 
allow the translation of a supported subset of the Excel formula language 
into the appropriate Java statements. The important difference between the 
two parsers is in their output. In interpreted mode, the ExcelComp parser 
returns a string at run time that represents a newly-computed value. For 
ExcelComp's compiled mode, the GenCell parser returns Java source code 
that is written to a set of source files as an offline preprocessing task. Once 
GenCell has completed generating all of the source files, the Java compiler is 
run to compile these files into class files of executable bytecode. The Excel- 
Comp user must then ensure that the Java classpath contains the appropriate 
references to allow the client application to find the classes at run time. 
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Figure 4: Preprocessing requirements for ExcelComp modes 
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6.3 Dependency Set Generation 

The nature of the preprocessing performed in compiled mode allows Excel- 
Comp to handle Dependency Set Generation as a distributed, on-demand 
task rather than as an explicit set of steps conducted immediately after a 
cell value is changed as in interpreted mode. 

6.4 Deferred Recomputation of Dependent Cells 

Rather than requiring that the immediate recomputation of all members of 
T> occur after a cell's value has been changed, compiled mode defers the 
recomputation of any member of D until the client programmer requests its 
value via an ExcelComp accessor method. This algorithm saves considerable 
time when compared to its interpreted mode counterpart, seeing that the 
recomputation of many dependencies whose value is never sought is avoided. 

6.5 Getting a Value 

Each CellXi object's getValue method contains the Java encoding of the 
formula for cell Xi. The value of the cell is recomputed and stored as the 
instance variable val only when that object's boolean dirty flag is true; 
this flag thus allows this method to avoid unnecessary recomputation. When 
recomputation is unnecessary, getValue just returns the value of val. 

6.6 Setting a Value 

When a CellXi object's setValue method is called, its instance variable 
val is set to the desired value, and a DFS traversal of all of Xi's discovered 
ancestors is performed to ensure that each such ancestor's dirty flag is set to 
true. This ensures that all ancestors' values are recomputed on a subsequent 
call to an ancestor's getValue method. Note that recomputations are only 
done if an ancestor's getValue method is called, thus saving the time of 
recomputing ancestor values that may never be accessed. 

6.7 Example 

The following example illustrates the workings of compiled mode using Fig- 
ure [U Given the spreadsheet represented by this graph, we will use compiled 



18 



mode to set Al's value to 2, and then get the value of El. Observe that El's 
value with Al set to 1 is 13. By changing Al's value to 2, we expect the new 
value of El to be 15. 

The first statement given by the client program is: 

Cell . getCell ("Al") . setValue (2) ; 

The initial part of this statement, Cell .getCell("Al"), creates a new in- 
stance of CellAl, since one does not yet exist. This new instance is cre- 
ated using the forName class method in the package java.lang. Class [T9*] . 
CellAl's base class constructor CellO is first called to set CellAl's dirty 
flag to true. CellAl's val variable is set to 1 by its constructor. The last 
part of this statement calls Cell's setValue method to set val to 2. In 
general, setValue recursively marks all of Al's discovered ancestors as dirty. 
However, since no ancestors have yet been discovered, and thus no corre- 
sponding CellXi objects have yet been instantiated, no such marking occurs 
here. We are thus left with one instance of CellAl that is marked as dirty 
and has a value of 2. 

We now get the value of El. The appropriate statement is: 

Cell . getCell ("El") . getValue () ; 

The first part of the statement behaves in the same way as for Al described 
above, only now the newly created object is an instance of CellEl. In addi- 
tion, CellEl's constructor initiates a DFS traversal of all of El's descendants 
to update each descendant's HashSet dependencies. For this example, both 
CellBl's and CellCl's dependencies sets are updated by having a reference 
to CellEl added. Note that CellCl's dependencies set does not refer to 
CellFl since, although Fl is dependent on CI, it is not a descendant of 
El. Such a reference to CellFl need only be added if Fl is the subject of 
future method calls. The last part of this statement checks to see whether 
the instance of CellEl has been marked as dirty. Since CellEl was just 
constructed anew, it is marked as dirty and thus its value must be recom- 
puted (in this case, computed for the first time). The statement in CellEl's 
getValue method that accomplishes this is: 

this. val = Cell.getCellC'Bl") .getValueO + 
Cell . getCell ("CI") . getValue () ; 
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Since the objects for Bl, CI, and Dl were all marked as dirty during the 
DFS traversal in CellEl's constructor, each object's getValue method will 
recompute the value for that cell. Each object's getValue method resets that 
object's dirty flag to false after the recomputation. Subsequent accesses 
to CellXi values that have not been affected by a change to a descendant's 
value simply return the value stored in val with no recomputation necessary. 

7 Performance 
7.1 Tests 

ExcelComp was tested for its running time performance in the SATCOM 
Availability Analyst (SA2) Java application jjj. The addition of ExcelComp 
API calls to SA2's map display function was chosen for test due to its de- 
manding requirement that 8,518 data points be updated for an on-screen 
Mercator map in such a way that the user is not burdened by long wait 
times for a complete update of the map. Processing each of the data points 
required 3 calls to ExcelComp methods; 2 of these calls each changed a cell 
value from the input spreadsheet, and the last call read back a cell value 
of interest from the newly updated spreadsheet. The map display function 
was selected and run 10+ times in each mode to characterize ExcelComp's 
performance. Running times associated with the first invocation of the map 
function were greater than subsequent trials, and thus were considered out- 
liers and removed from the representative data. These larger values probably 
reflect JVM-related setup steps that are not required on subsequent trials. 

The tests were conducted on a Hewlett-Packard HP OmniBook 4150 B 
running under Microsoft Windows 98 on a Pentium III 650 MHz processor. 
SA2 was run using the Sun Microsystems JVM version 1.3. Running times 
were computed as the difference in the start and end times returned by the 
Java method System. currentTimeMillis () |23j . 

The results of the performance tests are summarized in Table 121 The 
sample standard deviation is computed as the positive square root of the 
unbiased sample variance. The large difference in performance between the 
modes highlights how compiled mode can provide a very acceptable perfor- 
mance level in a case where interpreted mode, requiring over 5 minutes to 
complete, would be unacceptably slow. In this particular case, it is essential 
that compiled mode be chosen to make the use of ExcelComp feasible. 
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Compiled Mode 


Interpreted Mode 


Average 


677 


347057 


Sample Standard Deviation 


116 


254 



Table 2: ExcelComp Running Time (milliseconds) Performance over 10 Sam- 
ples 

7.2 Algorithm Complexity 

The graph traversal algorithms that underlie ExcelComp are well known to 
be efficient. Both DFS and BFS have running times that are linear in the 
size of the graph's adjacency list. Specifically, BFS is 0(V + E), and DFS is 
Q(V + E) sect. 22.2, 22.3]. 

Interpreted mode does not construct an adjacency list, and could very well 
benefit from a redesign to create this list upon the loading of the spreadsheet 
in the ExcelComp constructor. Because this construction must take place 
at run time, the user would incur a one-time performance penalty for this 
initialization step. The absence of an adjacency list suggests that interpreted 
mode's running time is probably greater than the linear time cited above. 

Compiled mode, on the other hand, does use a variation of adjacency 
lists in CellXi's dependencies set. However, while an adjacency list stores 
references to children, dependencies stores references to parents of CellXi. 
Compiled mode incurs a setup penalty during the discovery of cells, but 
subsequent accesses to CellXi objects are more efficient through the use of 
dependencies and workbook. Its use of adjacency lists, DFS traversals, and 
the efficient Java collections framework suggests that compiled mode has a 
running time that is close to 0(V + E). 

8 Conclusion 

We have seen that a graph representation of spreadsheet cell dependencies 
provides insight into the requirements of the algorithms used for the au- 
tomatic recomputation of dependent cells. Straightforward implementation 
of well-known graph traversal algorithms suffices for correct recomputation, 
however the adaptation of a well-studied garbage collection algorithm along 
with facilities made available in the Java language enable client programs to 
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run much faster, given some additional preprocessing. 

The client programmer should choose the mode of ExcelComp according 
to an analysis of the application's run time requirements and the tradeoffs 
between the modes as described in this paper. 
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